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Abstract 

Despite the apparent cross-disciplinary interactions among scientific fields, a formal description of their 
evolution is lacking. Here we describe a novel approach to study the dynamics and evolution of scientific 
fields using a network-based analysis. Wc build an idea network consisting of American Physical Society 
Physics and Astronomy Classification Scheme (PACS) numbers as nodes representing scientific concepts. 
Two PACS numbers are linked if there exist publications that reference them simultaneously. We locate 
scientific fields using a community finding algorithm, and describe the time evolution of these fields over 
the course of 1985-2006. The communities we identify map to known scientific fields, and their age 
depends on their size and activity. Wc expect our approach to quantifying the evolution of ideas to be 
relevant for making predictions about the future of science and thus help to guide its development. 

Introduction 

Cross-fertilization between different scientific fields has been recognized for its ability to encourage new 
developments and innovative thinking. For this reason, multidisciplinary approaches to research are 
becoming more popular. Some recent examples include applying physics techniques to the study of 
biological phenomena [1] , deriving an understanding of the nature of critical phenomena from rcnormal- 
ization techniques in particle physics [5] drawing inferences about the early universe from findings in 
terrestrial superfluid experiments [3], and using statistical physics to analyze technological and social 
systems [J. 

In an effort to move beyond anecdotal evidence of the benefit of interdisciplinary discourse for science, 
in this paper we study the dynamics of groups, or "communities" , of ideas using a statistical physics 
approach. Wc attempt to quantify the evolution of ideas and subdisciplines within physics as they 
emerge, interact, merge, stagnate, and desist. The quest for describing the development of scientific 
fields is not new. There have been epidemiological [5j|6] and network-based approaches (citation and 
collaboration networks) [7HT5] aiming to gain insight into the spread of scientific ideas. Recently the 
temporal evolution of several scientific disciplines have been modeled with a coarse-grained approach [16] . 

Here we build a scientific concept network consisting of American Physical Society PACS numbers as 
nodes representing scientific concepts. The American Institute of Physics (AIP) develops and maintains 
the PACS scheme as a service to the physics community in aiding the classification of scientific literature 
and information retrieval. Two PACS numbers are linked if there exist publications that reference them 
simultaneously. Our approach differs from previous methods in that it provides a direct, unsupervised 
description of scientific fields and uses techniques such as community finding and tracking from the field of 
network physics. This approach provides means to quantify how ideas and movements in science appear 
and fade away. Because this method makes it possible to measure the current and past state of the 
relationship between scientific concepts, it may also help to make predictions about the future of science 
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and thus inform efforts to guide its development. In this paper, we entertain some of the quantitative 
questions that this method permits; specifically, we seek to answer questions about the relationship 
between size, lifetime, and activity of scientific fields. 

Various local to global topological measures have been introduced to unveil the organizational princi- 
ples of complex networks |17H19j . One such measure that allows the discovery of organizational principles 
of networks is community finding. There have been a number of methods to find the communities in 
networks which describe the inherent structure or functional units of a network [201 - 123] . One of these 
is CFinder, a clique percolation method (CPM) introduced by Palla ct al. [21], which finds overlapping 
communities and is especially suitable for studying the evolution of scientific fields since scientific concepts 
are often shared among multiple fields. We use this CPM to track the evolution of physics. 

Results 

Building the Network 

Data were collected from the American Physical Society's (APS) Physical Review database from 
1977-2007. Journals included in the study are Physical Review Letters, Physical Review {A through E}, 
and Physical Review Special Topics: Accelerators and Beams. Papers in this database contain a list of 
author-assigned PACS codes, where each PACS code refers to a specific topic in physics. PACS itself is 
hierarchical, which is evident in the structure of the codes with up to 5 levels of topic specification. For 
example the PACS code '64.60.aq' has 5 levels where the first digit '6' represents the first level (in this 
case 'condensed matter'), '4' represents the second (e.g. 'equations of state, phase equilibria, and phase 
transitions'), the third and fourth digits '60' together represent the third level (e.g., 'general studies of 
phase transitions') while the last two characters 'aq' carry information pertaining to the fourth and fifth 
levels of specification (e.g. 'specific approaches applied to phase transitions' and 'networks', respectively). 

PACS codes are not static, rather, the coding scheme is periodically updated with the addition and 
deletion of codes. In order to (at least partially) account for this effect, the scientific concept network 
was constructed such that the nodes in the network represent individual PACS codes using the first four 
digits of specification, where changes to scheme are less probable. This network and the related material 
is available on our website [24]. In our network, an edge occurs between two nodes if the two PACS codes 
they represent are cited in the same paper; one paper in the database often contributes many nodes and 
edges to the network. Furthermore, edges are weighted by the number of papers that contain that edge. 
We introduce two measures, node and edge cutoffs, to control for noise in the network (see Methods 
section). 

The entire PACS network from 1977-2007 after both noise measurements were applied has 803 nodes 
and 23707 distinct edges. The degree of a node is the number of edges shared by the node. The weighted 
cumulative degree distribution follows a stretched exponential with the form, P{k) ^ exp[— (fc/842)°-^'^] as 
shown in Fig. [T]A_. The distribution has a similar form in the unweighted case. The dynamic classification 
scheme of the American Physical Society, implemented by the addition, splitting and removal of codes, 
may be preventing the formation of large hubs, thus keeping the specification of the codes more useful. 
The stretched exponential distribution may be the result of a sublinear- linear attachment type growth |25| . 

The PACS network also exhibits a weak but apparent hierarchical structure measured by the depen- 
dence of the clustering coefficient on (unweighted) degree. For a node i, the clustering coefficient is given 
as Ci = 2ni/ki(ki — 1), where rii is the number of edges that link the neighbors of node i, and ki is the 
degree of the node. The clustering coefficient for a node is the ratio of the number of triangles through 
node i over the possible number of triangles that could pass through node i [26j . A purely hierarchical 
network will have a (C) that scales as a power of fc, (C) ~ k~^, while a random network will have a 
clustering coefficient that is constant with k [26]. For this network, {C{k)) ^ k~^-^^ , shown in Fig. [TjB. 
This dependence is not surprising given the hierarchical structure of the classification scheme. 
Defining Communities in Physics 
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Papers published between 1985 and 2006 were used to study the community evolution of the network; 
1985 appears to be the first year when all journals present {Physical Review E began publication in 1993) 
consistently used the PACS data scheme, and 2007 was thrown out to exclude incomplete data from the 
analysis. The journal Physical Review Special Topics: Accelerators and Beams was not included because 
of an irregular publishing schedule. After the noise measures were carried out, the edge weights were no 
longer used, and the network became an unweighted network with respect to the community evolution 
analysis. The data were organized into 44 time bins, with each bin representing a 0.5 year time period. 
Once a paper (and the edges and nodes it contains) appears in the analysis, it is assigned a lifetime, 
of or 2.5 years. This assignment is an attempt to more realistically capture the nature of scientific 
dissemination, as well as the delay in time from publication to assimilation by the field. The analysis of 
community evolution begins at the time bin subsequent to the lapse of the assigned lifetime. Thus the 
first time bin, t = 0, for a paper lifetime of I = 2.5 refers to the latter half of 1987 since we start the 
analysis in 1985. 

In order to study the evolution of different fields in physics, one must first find these fields in our 
network. We hypothesize that scientific fields are represented by communities in our PACS network. 
These communities are found using the CFinder algorithm, which is based on a clique percolation method 
[21j . Figs. [2] and [3] present examples of the community structure extracted utilizing CFinder. 

For each community, the code (using only the first two digits) that encompasses the largest fraction 
of nodes in the community was found. Its name, specified by the PACS scheme, is then used to label 
the community. If a community has multiple codes which compose the same largest fraction of nodes in 
that community, then the community is assigned multiple labels. As shown in Figs. [2] and [3l we observe 
that the analysis captures expected scientific connections among fields in physics. For example, in 1997, 
particle physics is linked to both general relativity and astrophysics. It is also worthwhile to note the 
emergence of biophysics as a community in the 2005 analysis. 

Community Evolution and Dynamics 

In order to track the evolution of scientific fields, after identifying communities at each individual 
time interval, it is necessary to match the communities between adjacent time steps. We implemented 
a community evolution algorithm developed by Palla et al. |27| to match the communities between time 
bins (see Methods section). 

To gain a better understanding of the dynamics of evolving communities, we defined two properties of 
each community: size and activity. A value for each of these measures can be assigned to every community 
for each individual time bin. The size s of a community is the number of nodes contained within that 
community at time t. Size can be interpreted as a measure of a community's breadth: communities with 
a small size encompass only a few distinct ideas, while large communities encompass many distinct ideas. 
(The cumulative size distribution was calculated for different times and is displayed in Fig. SI.) 

The activity a of a community is defined as the number of papers that contain at least one node from 
that community at time t. As one expects, there is a strong correlation between size and activity (see 
Fig. S2). 

Next, we study the relationship between the age or lifetime of a community versus its size and activity. 
The age of a community at time t is simply the number of time bins the community has been present 
in the evolution analysis: t = i — to + 1, where is the time bin in which the community was born. In 
order to study the dependence of age on size, in each time bin, the current age r and size s are recorded. 
Using all communities from all time intervals, the median age is calculated for communities with the 
same size as shown in Fig. |4j\. There is a trend of r increasing with size s. Thus, it would appear 
that older communities tend to contain more nodes, and that longer lived fields tend to encompass many 
distinct ideas. Values for both the Pearson correlation coefficient, p, and the Spearman's rank correlation 
coefficient, p, were calculated between r and s using the raw, unbinned data, p = 1 — 6 jv(Af2-i) where 
is the number of data points and di is the difference in the statistical rank of the corresponding values 
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for each data point. For I = 2.5, the Pearson correlation coefficient was p = 0.4772 while the Spearman's 
p was calculated to be p = 0.5913. 

In order to measure the dependence of age on activity, the current age r is recorded along with the 
current activity a of every community in each time step. Because of the wide range of possible values 
for activity and noise in the data, the values of a arc sorted into 100 equally sized bins. The median 
age is calculated for all communities within the same activity interval. There is a trend of r increasing 
with activity as shown in Fig. |4j3 which can be partially understood by the strong correlation between 
size and activity. Further we note an apparent phase transition in activity; as shown in Fig. |4j3 after 
some critical value, communities tend to be longer lived. This transition also appears for I = (see Fig. 
S3). Lifetime as a function of size, t(s), for Z = is shown in Fig. S4. Again, the Pearson correlation 
coefficient and the Spearman's rank correlation coefficient were calculated for I ~ 2.5 using the raw, 
unbinned data between t and a, with p = 0.3283 and p = 0.3764. 

Discussion 

In this paper, we have developed an approach that enables the quantitative study of the evolution of 
physics fields, specifically by following the dynamical connections between various ideas within physics. 
From our investigation, we have shown that long lived communities tend to be larger, and arc associated 
with a higher number of papers. 

Our approach opens up an interesting possibility of being able to predict community dynamics and 
impact from the current network structure. Furthermore, this method can be easily adapted to other 
scientific fields using different databases. One such is the INSPEC database which has comprehensive 
coverage of research activity in computer science and engineering in addition to physics, and has an 
expert-assigned classification scheme rather than author-based assignments. 

Materials and Methods 

Noise Measures 

A node cutoff is introduced such that in a given time interval a node must appear at least twice to be 
included in the network. This measure eliminates many of the typographical errors occurring in the 
database. The edge cutoff, however, takes into account the random expectation of two PACS codes co- 
occurring in the same paper. For this cutoff, the weight of an edge between nodes i and j, Wij, which is 
the number of papers that both codes i and j appear in, is compared to the weight expected at random, 
Eij = niUj /N, where Ui and rij are the number of papers containing nodes i and j respectively, and N 
is the total number of papers present in the time interval. If Wij/Eij > 1.2, then the appearance of the 
edge is significant compared to random appearance, and we include it in the network. 



CFinder 

The CFinder algorithm is described in detail in Ref. [21]. A community is defined as a union of all 
/c-cliques (complete subgraphs of size k) that can be reached from each other through a series of adjacent 
fc-cliques (where adjacency means sharing k — 1 nodes) [21] . 
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Picking a k value 

For this study, fc = 9 was principally used (for 1=2.5) because it appears to produce a large number of 
communities while discouraging the formation of giant communities. Further, by keeping k constant, we 
keep the resolution constant for the entire analysis. Picking an appropriate k value for the analysis is 
done by considering two properties: the number of communities present, and the presence of overly large 
communities |21| . It is desirable to have a large number of communities, so as to increase the statistical 
quality of measurements made on the network. Fig. S6 plots the number of present communities for each 
time step for k = 8,9, and 10, for I ~ 2.5. As demonstrated, the number of communities found using the 
choice of A: = 10 tends to be less than the other parameter choices, making it less favorable in terms of 
improving statistical quality. 

A k value must also be large enough to avoid the introduction of overly large communities that obscure 
the actual community structure of the network |21| . To quantify this property, we use the quantity r 
which is the ratio of the size of the largest community to the second largest community for a given time 
bin. Thus while some distribution in the sizes of communities is necessary, r should not be overly large. 
Fig. S7 plots the measure r against all time bins for I = 2.5. For k = 8, the values of r tends to be larger 
than (signifying giant communities) than those calculated from the other two parameter values, making 
it an unfavorable parameter choice. 

Community Matching 

The community matching algorithm is described in detail in Ref. j27j . In this analysis, an appropriate 
fc-value is used rather than a constant edge- weight cutoff. A running stationarity measure is described in 
Appendix SI and Figure S5. The merger of two commimities is described in Appendix SI and Figures 
S8 and S9. 
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Figure Legends 




1000 



k k 

Figure 1. Measurements on the PACS network from 1977-2007. A) Cumulative degree 
distribution P(k) of the PACS network. The red line is a fit to the data. Both the weighted and 
unweighted cases follow a stretched exponential distribution. B)Average clustering vs degree for 
the PACS network, demonstrating that C(fc) has some dependence on degree. Thus, there is 
some hierarchical structure present in the network. 
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33: Molecular propertie 
interactions with photons 

34: Atomic and molecular 
collisions/ interactions 



32: Atomic properties/ interactions 
with photons 



31 : Electronc structure of atoms/ molecules 

14: Particle physics 

04: General relativity 
52: Plasma physics 

42: Optics 

47: Fluid dynamics 




98: Astrophysics 
83: Rheology 
64: Phase transiti' 



Statistical physics 
68: Surfaces/ thin films 



78: Optical properties: 
cond. matter 



61 : Crystallography 

71 : Electronic structure: 
bulk materials 

75: Magnetic properties 

Figure 2. The scientific concept network for the first half of 1997. Nodes corresponding to 
scientific fields, as well as node labels and their corresponding fields, are shown in the same color. The 
size of the nodes corresponds to the number of PACS codes contained in that community. Same-color 
neighboring nodes have the same label. The thickness of the edges correspond to the number of shared 
PACS codes between communities (the weight of the edge). The community structure is shown at 
t = 9.5 years, corresponding to first half of 1997, using CFindcr with I = 2.5 years. Labels are assigned 
by looking at the first two digits of the PACS codes that make up the largest fraction of each 
community. 
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04: General relativity 



11/12/13: Particle physics 



32: Atomic properties/ 
interactions with 
photons 



31: Electronic structure 
atoms/ molecules 




47: Fluid dynamics 

64: Phase transitions 

61 : Crystallography 



62: Mechanical 
properties: cond. matter 



77: Dielectrics/ piezoelectics/ 
ferroelectrics 

68: Surfaces/ interfaces 



78: Optical properties: 
cond. matter 
71: Electronic structure: 
bulk materials 



74: Superconductivity 
Figure 3. The scientific concept network for the first half of 2005. 
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Figure 4. For I — 2.5 years, the median lifetime (years) as a function of A) size; B) activity 

(a). Error bars represent the 1st and 3rd quartilcs respectively. For both sets of data, the Spearman's 
rank correlation coefficient, p, was computed using the unbinned data. 
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1 Community Dynamics 

Once community structure is established, a variety of different measurements are performed on the 
dynamics of the evolving communities. 

The cumulative community size distribution appears long tailed over one decade, which is robust as 
a function of t (years), as shown in Fig. 311 

Fig. ^ plots for all time intervals the size of every community against its activity a for paper lifetime 
I = 2.5. There appears to be a positive correlation between the two measures, and this trend is observed 
for I = (not shown). 

The dependence of age on activity was measured and Fig. 33 shows the results for the r vs. a 
measurements for k = 7 with a paper lifetime of ^ = years, where a values were binned because of the 
wide range in a, as well as to reduce noise. In both cases (the one presented here and the one presented 
in the letter) there is a trend of age increasing as activity increases (though less apparently for I = 0) 
and, as expected given the correlation between a and s, one sees a similar relationship between age and 
size, as shown in Fig. 331 Thus, older communities also tend to encompass more publications, a result 
that agrees with naive expectation. Further, we note an apparent phase transition in both paper lifetime 
cases (more apparent for 1=2.5 than for 1 = 0); after some critical a, communities tend to be longer lived. 

Further understanding of the community dynamics can be gained by studying the volatility of the 
evolving communities, a measure of how much communities tend to change between subsequent time 
steps. To see this, we define an age dependent running stationarity ^(r) based on community correlation 
and stationarity presented by Palla et al. [1]. The correlation C{t,t') between two states of the same 
community A(t) at times t and t' is 



A{t) n A{t') 



A{t) U A{t') 

Then the running stationarity, ^(r), of that community is the average correlation between subsequent 



1 



2 



time steps up to age r, 

^ to+r-2 

t'=to 

The running stationarity, ^(t), is plotted against lifetime for every community with r > 1 along with 
its current age at time t, for all t, for Z = in Fig. ^ This result is qualitatively similar to results 
obtained using randomized correlations. For larger values of I, the distribution shifts to larger values of 

2 Picking a k value 

Throughout the paper, fc = 9 is principally used (for 1=2.5) because it appears to produce a large number 
of communities while discouraging the formation of giant communities. Further, by keeping k constant, 
we keep the resolution constant for the entire analysis. Picking an appropriate k value for the analysis is 
done by considering two properties: the number of communities present, and the presence of overly large 
communities [5]. It is desirable to have a large number of communities, so as to increase the statistical 
quality of measurements made on the network. Fig. ^plots the number of present communities for each 
time step for k = 8, 9, and 10, for I = 2.5. As demonstrated, the number of communities found using the 
choice of fc = 10 tends to be less than the other parameter choices, making it less favorable in terms of 
improving statistical quality. 

A k value must also be large enough to avoid the introduction of overly large communities that ob- 
scure the actual community structure of the network To quantify this property, we use the quantity 
r which is the ratio of the size of the largest community to the second largest community for a given time 
bin. Thus while some distribution in the sizes of communities is necessary, r should not be overly large. 
Fig. 33 plots the measure r against all time bins for I ~ 2.5. For fc = 8, the values of r tend to become 
larger (signifying giant communities) than those calculated from the other two parameter values, making 
it an unfavorable parameter choice. 



3 Merging of communities 

Lastly, we present an example of a merger between two communities. Tracking a nuclear physics com- 
munity. Fig. ^ shows the size of that community as a function of time for fc = 9 and a community of 
similar nodes with k = 10, using / = 2.5. 

With fc = 9, it appears that this particle physics community abruptly dies at t = A years. Increasing 
the cohesiveness of communities by increasing to fc = 10 demonstrates that a community composed of 
similar nodes continues to propagate past this time of apparent death. Thus, it seems that the nuclear 
physics community is still present in the network, but has become absorbed by another community. 

Fig. 39]plots a community ait — 4 years with the nodes from the nuclear physics community displayed 
in green. We can assign a label to this community in the usual manner using the nodes present just before 
the apparent death of the nuclear physics community. Doing so, the absorbing community is comprised 
of the 'physics of elementary particles and fields: specific reactions and phenomenology' in the time bin 
prior to its absorption of the particle physics community. 
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Figure 1. The cumulative size distribution for various times in the network. The distributions appear 
long tailed over one decade. 
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Figure 2. The activity a of each community plotted against its size s for every time 
interval (l = 2.5). Notice the positive correlation between a and s. 
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Figure 3. The median lifetime as a function of activity for k = 7, I = 0. Notice the trend of r 
increasing with activity. 




Figure 4. The median lifetime as a function of size for k = 7, 1 = 0. Notice the trend of t 
increasing with size. 
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Figure 5. Age of each community (fc 
time bins. 



= 7, / = 0) vs its running stationarity value for all 
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Figure 6. The number of communities present in the network (after the noise measures 
have been applied) as a function of time for various k values, with I = 2.5. In order to 
improve the statistical quality of the analysis, larger numbers of communities are favorable, making 
/c = 10 an unfavorable parameter choice. 
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Figure 7. The ratio r of the size of the largest community present divided by size of the 
second largest community for every time bin for / = 2.5. Large r indicates the presence of overly 
large communities that obscure the community structure; thus = 8 is an unfavorable choice of 
parameter. 
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Figure 8. Size of the nuclear physics community vs time for k = 9 and k = 10, using I = 2.5. 

While the community appears to die at t = 8 (4 years) for fc = 9, a community of similar nodes is seen 
to continue beyond the time of apparent death when using the higher community cohesiveness 
requirement of k = 10. It is possible then that the nuclear physics community is still present in the 
analysis, but has merged with another community. 



Figure 9. Merger of the nuclear physics community (green) with another community 
(particle physics: specific reactions and phenomenology) at the time of apparent death, 
t = 8 (4 years) for the nuclear physics community. 



